knitr::opts_chunk$set(fig.width=12, fig.height=8, width = 200,
echo=FALSE, warning=FALSE, message=FALSE)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
This report explores the Red Wine Quality data set containing quality and attributs for 1599 red wines.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
From the summary, we can see that attributes like fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates have outliers since their difference between max and 3rd Qu are very big.
Most redwine are in the middle range of quality.
From the above two histogram, we can see that the attributes fixed.acidity and volatile.acidity are almost normally distributed. Meanwhile, according to the dataset introduction, high level of volatile acidity will lead to an unpleasant, vinegar taste, which I wonder would be an important factor that effect the judgement of quality.
The histogram of citric.acid on the first row looks skewed, after log transform, the peak is more obvious. One possible explaination is that citric acid are found in small quantities, the bandwidth are not small enough to separate different values. Also as described, citric acid can add ‘freshness’ and flavor to wine which may also be an important factor for quality rating.
The distribution of residual.sugar is close to normal, but have some very distinct outliers.
The distribution of chlorides are similar to residual.sugar. We can see that most red wine are of low level of residul sugar and chlorides.
Both distribution of free.sulfur.dioxide and total.sulfur.dioxide are skewed with distinct outliers, and they are close to normal after log transform.
Density histogram is also similar to the normal distribution in a small range close to water density.
pH values are all lower than 7 and it is normally distributed.
Sulphates are almost normal distributed with few outliers.
Histogram of alcohol looks skewed and there is not too much improvement after log transform or sqrt transform.
There are 1599 red wines in the dataset with 12 attributes including output attribute(quality) and one sequential number(X).
The main feature is quality and I would like to explore which one or more attributes in this dataset affect the quality grading of red wine.
The background knowledge given by the dataset info makes me would like pay more attention to the volatile.acidity and citric.acid. Meanwhile, after univariate plots, we can notice that several attributes have few very distinct outliers, which may relates with the fact that only a small amount of red wines have very low or high quality.
No.
There are several distributions are skewed, and will be better looked after log transform except for alcohol. Also there are some distributions have very distinct outliers.
From the above correlation matrix, we can notice that:
quality 1. quality positively correlates with alcohol 2. quality negatively correlates with volatile.acidity
acidity 1. fixed.acidity positively correlates with citric acid 2. fixed.acidity negatively correlates with pH 3. volatile.acidity negatively correlates with citric acid 4. citric acid negatively correlates with pH
density 1. fixed.acidity positively correlates with density 2. alcohol negatively correlates with density
The attributes metioned above may related to the rating of quality, the free.sulfur.dioxide and total.sulfur.dioxide can be excluded.
To get a better understanding of those variables’ correlation with quality, we can create a boxplot and frequency polygram for each level of quality. Since the quality is int type which has limited number of values, we can convert it to factor for the plotting.
It seems like only when quality>=6, higher alocohol content will lead to better quality, such positive correlation does not apply to the wine with lower quality. Meanwhile, quality 5 wines have a larger range of alcohol content with a lot of outliers..
From the two figures above, we can see that as volatile.acidity decreases, quality increases generally, which in accordance with their negative correlation. Also it reflects the dataset info “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”. However, there are few exceptions that may not fit the conclusion:
The outliers of high quality(7, 8) red wines have same level of volatile.acidity as low quality redwines(3, 4).
Some quality of 7 wines have even lower level of volatile.acidity than quality 8 wines.
Next, we continue to explore the relationship within these attributes except quality.
The above four plots presents the relationship among the attributes related with acidity. pH has stronger negative correlation between fixed.acidity and citric.acid, which makes sense and we can ignore pH for further analysis.
From the above two plots we can see that density is highly correlated with fixed.acidity. Meanwhile, the fact that alcohol density is smaller than water’s explain the negative correlation between density and alcohol.
I found it interesting that why volatile.acidity has negative correlation with fixed.acidity and postive correlation with pH at first, but after reading more about the info, it makes sense.
Positive relationship - quality vs alcohol - fixed.acidity vs citric.acid
Negative relationship - quality vs volatile.acidity - volatile.acidity vs citric.acid
From this plot, it is hardly to find the relationship with the quality. And we found that the main points are of quality 5 and 6, so let’s look back to the dataset, level 3, 4 ,8 have insufficient data for analysis. So I would like to remove theem and only focus on other levels with larger number of samples.
## 3 4 5 6 7 8
## 10 53 681 638 199 18
From the plots above, the distribution of different quality wine are dispersive. Also there is not too much improvement by the scale transforming like sqrt or log10, which makes it difficult to define their linear relationship.
Since it is not easy to find the appropriate the scale of these attributes, I will build the linear model with the original form of the most important attributes found by the previous analysis, alcohol, volatile.acidity, citric.acid, fixed.acidity.
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = wine_sample)
## m2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity,
## data = wine_sample)
## m3: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## citric.acid, data = wine_sample)
## m4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## citric.acid + fixed.acidity, data = wine_sample)
##
## ============================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------------------
## (Intercept) 0.229 1.108*** 1.067*** 0.651***
## (0.152) (0.166) (0.175) (0.195)
## alcohol 0.332*** 0.297*** 0.297*** 0.306***
## (0.015) (0.014) (0.014) (0.014)
## volatile.acidity -0.987*** -0.945*** -1.027***
## (0.088) (0.105) (0.106)
## citric.acid 0.068 -0.313*
## (0.092) (0.121)
## fixed.acidity 0.055***
## (0.012)
## ----------------------------------------------------------------------------
## R-squared 0.255 0.311 0.312 0.322
## adj. R-squared 0.254 0.310 0.310 0.320
## sigma 0.598 0.575 0.575 0.571
## F 518.376 342.540 228.474 179.355
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1371.875 -1311.941 -1311.667 -1300.544
## Deviance 541.720 500.589 500.408 493.128
## AIC 2749.751 2631.882 2633.334 2613.087
## BIC 2765.726 2653.183 2659.960 2645.038
## N 1518 1518 1518 1518
## ============================================================================
This model can only predict median quality well which has more samples while the others have biased prediction.
The positive relationship within the fixed.acidity and citric.acid and the negative relationship within the volatile.acidity and citric.acid do not strengthened each other when looking at the quality as seen from the plots that the samples are distributed dispersively and hard to find the boundaries to separate different quality samples.
I am suprised that there are still so many outliers even I subset the data with only median quality samples.
I have created a linear model with the subset of dataset only including the median quality wines(5, 6, 7) since the others have very limited samples. By adding more features to it, the R-squared increases or not change, but the deviance decreases, which shows the stablility of the model is improved. Seen from the error results, this linear model can hardly predict poor or good quality. Maybe it’s better to subset the data in a different way that contain 80 percent samples of each level of quality in training and the rest 20 percent for testing.
With this plot which visualizes the correlation matrices and help us find the most correlated attributes to the quality which we interested in and it is also a convenient and direct way to find other strongly related attributes.
This plot demonstrates that volatile.acidity is linearly correlated with quality, which in accordance with the dataset info “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”. Since the quality of this redwine dataset is the median value of at least 3 evaluations made by wine experts. From this plot we can notice that the taste plays an important role for experts judgement, which leads me to direction of taste for later analysis.
After univariate analysis, bivariate analysis and multivariate analysis, I created a linear model of the most important attributes. And this plot shows the performance of the model. The upper one presents training error and the lower one is the results for the testing data.
To be honest, I was not familiar with the red wine and its properties(also because of my poor English), so it took me a long time to understand the dataset at the beginning.
During the exploration and analysis, I found that the data itself caused few limitations. 1. The samples for good or poor quality redwines are so rare that makes them like outliers thus made it hard to get a fair conclusions. 2. From my understanding, the quality is the median of at least 3 experts’ evaluation. Are these samples from the same group of experts? Are the difference within different experts large?
The analysis procedure actually went well. By plotting the distribution of each variables, I got a basic understanding of the dataset, and then the correlation matrix helps to find the most important two factors to the quality, after deeper bivariate analysis, I choose four attributes to build the final linear model. I believe this workflow is clear and insightful to perform in the future work with almost every dataset